Extracting gradient features from neural networks

ABSTRACT

A method for extracting a representation from an image includes inputting an image to a pre-trained neural network. The gradient of a loss function is computed with respect to parameters of the neural network, for the image. A gradient representation is extracted for the image based on the computed gradients, which can be used, for example, for classification or retrieval.

BACKGROUND

The exemplary embodiment relates to image representation and findsparticular application in connection with a system and method forrepresenting images using weight gradients extracted from a neuralnetwork.

Image representations are widely used for image classification (alsoreferred to as image annotation), which involves describing an imagewith one or multiple pre-determined labels, and similarity computation.One form of representation is the bag-of-visual-words (BOV). See, Sivic,at al., “Video Google: A text retrieval approach to object matching invideos,” ICCV, vol. 2, pp. 1470-1477, 2003; Csurka, et al., “Visualcategorization with bags of keypoints,” ECCV SLCV workshop, pp. 1-22,2004. The BOV entails extracting a set of local descriptors, encodingthem using a visual vocabulary (i.e., a codebook of prototypes), andthen aggregating the codes into an image-level (or region-level)descriptor. These descriptors can then be fed to classifiers, typicallykernel classifiers such as SVMs. Approaches which encode higher orderstatistics, such as the Fisher Vector (FV) (Perronnin, et al., “Fisherkernels on visual vocabularies for image categorization,” CVPR, pp. 1-8,2007, hereinafter, “Perronnin 2007”; and Perronnin, et al., “Improvingthe fisher kernel for large-scale image classification,” ECCV, pp.143-156, 2010, hereinafter, “Perronnin 2010”), led to improved resultson a number of image classification tasks. See, Sanchez, et al., “Imageclassification with the fisher vector: Theory and practice,” IJCV, 2013.

Convolutional Networks (ConvNets) have also been used for labelingimages. See, Krizhevsky, et al., “ImageNet classification with deepconvolutional neural networks,” NIPS, pp. 1106-1114, 2012, hereinafter,“Krizhevsky 2012”; Zeiler, et al., “Visualizing and understandingconvolutional networks,” ECCV, pp. 818-833, 2014, hereinafter, “Zeiler2014”; Sermanet, et al., “OverFeat: Integrated recognition, localizationand detection using convolutional networks,” ICLR, 2014; Simonyan, etal., “Very Deep Convolutional Networks for Large-Scale ImageRecognition,” arxiv 1409.1556, 2014, hereinafter, “Simonyan 2014.”ConvNets are trained in a supervised fashion on large amounts of labeleddata. These models are feed-forward architectures involving multiplecomputational layers that alternate linear operations, such asconvolutions or average-pooling, and non-linear operations, such asmax-pooling and sigmoid activations. The end-to-end training of thelarge number of parameters inside ConvNets from pixel values to thespecific end-task is a source of their usefulness.

ConvNets have recently been shown to have good transferabilityproperties when used as “universal” feature extractors. Yosinski, etal., “How transferable are features in deep neural networks?” NIPS, pp.3320-3328, 2014. If an image is fed to a ConvNet, the output of one ofthe intermediate layers can be used as a representation of the image.Several methods have been proposed. See, for example, Donahue, et al.,“DeCAF: A deep convolutional activation feature for generic visualrecognition,” ICML, 2014, hereinafter, Donahue 2014; Oquab, et al.,“Learning and transferring mid-level image representations usingconvolutional neural networks,” CVPR, pp. 1717-1724, 2014, hereinafter,“Oquab 2014”; Zeiler 2014; Chatfield, et al., “Return of the devil inthe details: delving deep into convolutional nets,” BMVC, 2014,hereinafter, “Chatfield 2014”; Razavian, et al., “CNN featuresoff-the-shelf: An astounding baseline for recognition,” CVPR Deep VisionWorkshop, pp. 512-519, 2014, hereinafter, “Razavian 2014”). To use theserepresentations in a classification setting, a linear classifier istypically used.

Hybrid approaches have also been proposed which combine the benefits ofdeep learning using ConvNets with “shallow” bag-of-patchesrepresentations that are based on higher-order statistics, such as theFV. For example, it has been proposed to stack multiple FV layers, eachdefined as a set of five operations: i) FV encoding, ii) superviseddimensionality reduction, iii) spatial stacking, iv) l₂ normalizationand v) PCA dimensionality reduction. When combined with the original FV,such networks can lead to significant performance improvements in imageclassification. See, Simonyan, et al., “Deep Fisher Networks forLarge-scale Image Classification,” NIPS, 2013). Improvements on the FVframework have been achieved by jointly learning the SVM classifier andthe GMM visual vocabulary. Sydorov et al. “Deep Fisher kernels—End toend learning of the Fisher kernel GMM parameters,” CVPR, pp. 1402-1409,2014. The gradients corresponding to the SVM layer are back-propagatedto compute the gradients with respect to the GMM parameters. Goodresults on a number of classification tasks have been obtained byextracting mid-level ConvNet features from large patches, embedding themusing VLAD (vector of locally aggregated descriptors) encoding (anextension of the Bag-of-Words representation), and aggregating them atmultiple scales. See, Gong, et al., “Multi-scale orderless pooling ofdeep convolutional activation features,” ECCV, pp. 392-407, 2014.

The present system and method provide an efficient way to use ConvNetsfor generating representations that are particularly useful forcomputing similarity.

INCORPORATION BY REFERENCE

The following reference, the disclosure of which is incorporated byreference in its entirety by reference, is mentioned:

U.S. application Ser. No. 14/691,021, filed Apr. 20, 2015, entitledFISHER VECTORS MEET NEURAL NETWORKS: A HYBRID VISUAL CLASSIFICATIONARCHITECTURE, by Florent C. Perronnin, et al., describes a an imageclassification method which includes generating a feature vectorrepresenting an input image by extracting local descriptors from patchesdistributed over the input image and generating a classification valuefor the input image by applying a neural network (NN) comprising anordered sequence of layers to the feature vector, where each layer ofthe ordered sequence of layers is applied by performing operationsincluding a linear vector projection and a non-linear vectortransformation.

U.S. application Ser. No. ______, filed contemporaneously herewithentitled LATENT EMBEDDINGS FOR WORD IMAGES AND THEIR SEMANTICS, byAlbert Gordo, et al.

BRIEF DESCRIPTION

In accordance with one aspect of the exemplary embodiment, a method forextracting a representation from an image includes inputting an image toa pre-trained neural network. The gradient of a loss function iscomputed with respect to parameters of the neural network for the image.A gradient representation of the image is extracted, based on thecomputed gradients.

At least one of the computing and the extracting may be performed with aprocessor.

In accordance with another aspect of the exemplary embodiment, systemfor extracting a representation from an image includes memory whichstores a pre-trained neural network. A prediction component predictslabels for an input image using a forward pass of the neural network. Agradient computation component computes the gradient of a loss functionwith respect to parameters of the neural network for the image, based onthe predicted labels, in a backward pass of the neural network. Agradient representation generator extracts a gradient representation ofthe image based on the computed gradients. An output component outputsthe gradient representation or information based thereon. A processor incommunication with the memory implements the gradient component andprediction component.

In accordance with another aspect of the exemplary embodiment, a methodfor extracting a representation from an image includes generating avector of label predictions for an input image in a forward pass of apre-trained neural network. An error vector is computed, based ondifferences between the vector of label predictions and a standardizedprediction vector. In a backward pass of the neural network, thegradient of a loss function with respect to parameters of the neuralnetwork is computed for the image, based on the error vector. A gradientrepresentation of the image is extracted, based on the computedgradients. The gradient representation or information based thereon isoutput.

At least one of the computing and the extracting may be performed with aprocessor.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a functional block diagram of a system for generating arepresentation of an image based on weight gradients derived from aconvolutional network, in accordance with one aspect of the exemplaryembodiment;

FIG. 2 is a flow chart illustrating a method for generating arepresentation of an image based on the weight gradients, in accordancewith another aspect of the exemplary embodiment;

FIG. 3 illustrates an adaptation of a convolutional network toextraction of weight gradients; and

FIG. 4 provides a summary of notations for a fully connected layer kparameterized by a weight matrix W_(k), with input x_(k-1), outputbefore performing non-linearity y_(k), and x_(k) after performing anon-linearity σ.

DETAILED DESCRIPTION

Aspects of the exemplary embodiment relate to a system and method forgenerating image representations and using the image representations togenerate information, such as performing classification or imageretrieval tasks. The system and method make use of a trained neuralnetwork, specifically, a Convolutional Network (ConvNet), generated byend-to-end learning of deep feed-forward models from raw pixel values.The learned neural network is used to derive gradient features, usingbackpropagation in the neural network. The gradient representationderived from these features corresponds to a structured matrix thatfacilitates efficient similarity computation.

The Fisher Kernel (FK) involves deriving a kernel from an underlyinggenerative model of the data by taking the gradient of thelog-likelihood with respect to the model parameters. In contrast, in theexemplary method described herein, the gradient features are computedbased on a cross-entropy criterion measured between predicted classprobabilities output by a neural network and a standardized predictionvector (e.g., an equal probability output).

The exemplary gradient representation is in the form of a matrix whichallows the trace kernel to be used to measure the similarity between twosuch representations of different images. The trace kernel can bedecomposed into a product of two sub-kernels. The first one is adot-product between the output of the intermediate layers, as computedduring the forward pass, and is equivalent to a standard heuristickernel. The second sub-kernel is a dot-product between quantitiescomputed during the backward pass.

The neural network-based gradient representation can lead to consistentimprovements with respect to alternative methods that represent an imageusing only quantities computed during the forward pass.

With reference to FIG. 1, a computer-implemented system 10 forgenerating a gradient-based representation 12 of an input object such asan image 14, is shown. The representation 12 is based on weightgradients, as described in detail below. The system includes memory 16which stores instructions 18 for performing the method illustrated inFIG. 2 and a processor 20 in communication with the memory for executingthe instructions. The system 10 may include one or more computingdevices 22, such as the illustrated server computer. One or moreinput/output devices 24, 26 allow the system to communicate withexternal devices, such as an image capture device 28, or other source ofan image, via wired or wireless links 30, such as a LAN or WAN, such asthe Internet. The capture device 28 may include a camera, which suppliesthe image 14, such as a photographic image or frame of a video sequence,to the system 10 for processing. Hardware components 16, 20, 24, 26 ofthe system communicate via a data/control bus 32.

The illustrated instructions 18 include a neural network (NN) trainingcomponent 40 and a gradient representation generator 42, a predictioncomponent 44 and a gradient computation component 46. The instructionsmay further include one or more representation processing components,such as a classifier 48 and/or a similarity computation component 50,and an output component 52.

The NN training component 40 trains a neural network (NN) 56, such as aConvNet. The neural network includes an ordered sequence of supervisedoperations (i.e., layers) that are learned on a set of labeled trainingobjects 58, such as images and their true labels 59. The training imagelabels 59 are drawn from a set of two or more predefined class labels.In an illustrative embodiment, where the input image 14 includes avehicle, the set of labeled training images 58 comprises a database ofimages of vehicles each labeled to indicate a vehicle type using alabeling scheme of interest (such as class labels corresponding to“passenger vehicle,” “light commercial truck,” “semi-trailer truck,”“bus,” etc., although finer-grained labels or broader class labels arealso contemplated). The supervised layers of the neural network 56 aretrained on the training images 58 and their labels 59 to generate aprediction 60 (e.g., in the form of class probabilities) for a new,unlabeled image, such as image 14. In some embodiments, the neuralnetwork 56 may have already been pre-trained for this task and thus thetraining component 40 can be omitted.

The representation generator 42 uses the trained neural network 56 toobtain a gradient representation 12 of the input image 14. First, theprediction component 44 generates class predictions 60 for the inputimage 14, using a forward pass of the NN layers and computes a set oferrors (error vector) 62 by computing the differences between thepredictions 60 and a standardized set of predictions 63, which can be anequal-valued or otherwise fixed-valued vector. The computed error 62 isused by the gradient computation component 46 to compute a matrix ofweight gradients 66, 68, and/or 70 in a backward pass of one or more ofthe layers of the NN 56. The gradient representation component generatesa representation 12 for the image based on the matrix of weightgradients. In some embodiments, the representation is or is based on thematrix of weight gradients itself. In other embodiments, therepresentation 12 can be a function of the matrix of weight gradientsand features previously extracted from the image 14 in the forward passof the neural network. The image gradient representation 12 may beclassified by the previously-learned classifier 48 or used by thesimilarity computation component 50 to compute a similarity with therepresentation(s) of another image or images, e.g., for performing imageretrieval.

The output component 52 outputs information 72, such as therepresentation 12 of the image 14, a label (or label probabilities) forthe image, output by the classifier 48, and/or one or more similarimages identified from a collection of images based on similarity oftheir respective gradient representations 12, output by the similaritycomputation component 50, or other information based thereon.

The computer system 10 may include one or more computing devices 22,such as a PC, such as a desktop, a laptop, palmtop computer, portabledigital assistant (PDA), server computer, cellular telephone, tabletcomputer, pager, image capture device, such as camera 28, combinationthereof, or other computing device capable of executing instructions forperforming the exemplary method.

The memory 16 may represent any type of non-transitory computer readablemedium such as random access memory (RAM), read only memory (ROM),magnetic disk or tape, optical disk, flash memory, or holographicmemory. In one embodiment, the memory 16 comprises a combination ofrandom access memory and read only memory. In some embodiments, theprocessor 20 and memory 16 may be combined in a single chip. Memory 16stores instructions for performing the exemplary method as well as theprocessed data 60, 62, 66, 68, 70, 12.

The network interface 24, 26 allows the computer to communicate withother devices via a computer network, such as a local area network (LAN)or wide area network (WAN), or the internet, and may comprise amodulator/demodulator (MODEM) a router, a cable, and and/or Ethernetport.

The digital processor device 20 can be variously embodied, such as by asingle-core processor, a dual-core processor (or more generally by amultiple-core processor), a digital processor and cooperating mathcoprocessor, a digital controller, or the like. The digital processor20, in addition to executing instructions 18 may also control theoperation of the computer 22.

The term “software,” as used herein, is intended to encompass anycollection or set of instructions executable by a computer or otherdigital system so as to configure the computer or other digital systemto perform the task that is the intent of the software. The term“software” as used herein is intended to encompass such instructionsstored in storage medium such as RAM, a hard disk, optical disk, or soforth, and is also intended to encompass so-called “firmware” that issoftware stored on a ROM or so forth. Such software may be organized invarious ways, and may include software components organized aslibraries, Internet-based programs stored on a remote server or soforth, source code, interpretive code, object code, directly executablecode, and so forth. It is contemplated that the software may invokesystem-level code or calls to other software residing on a server orother location to perform certain functions.

FIG. 2 illustrates an exemplary method which may be performed with thesystem of FIG. 1. The method begins at S100.

At S102, the neural network 56 is optionally trained on the set oftraining images 58 and their labels 59 to predict labels for unlabeledimages. The training images may include at least some images of objectsthat are labeled with labels which are expected to be encountered duringthe test phase. The training includes learning the weight matrices andtensors of the neural network through backward passes of the neuralnetwork. Then, for a new image 14, passed completely through thepre-trained neural network in a forward direction, the neural networkoutputs label predictions 60 for the set of possible labels. As will beappreciated, even though this is how the neural network is trained, itis not primarily used for label prediction in the exemplary embodiment,but rather to use the predictions 60 to generate a weight gradient-basedrepresentation of a new image.

At S104, a new image 14 is received and may be stored in memory 16. Thenew image 14 may be preprocessed, e.g., resized (reduced pixelresolution) and/or cropped, or the like, before use to the same size andshape as the training images 58.

At S106, the trained neural network 56 is used in a forward pass togenerate a set of predictions 60 for the image. This includes generatinga set of 3D and 1D tensors (forward features) by projection andrectification.

At S108, a set of errors (error vector) 62 is computed based on thedifference between each standardized prediction, obtained from vector63, and the corresponding image prediction, obtained from the vector 60,for each index (class).

At S110, in a backward pass of the trained neural network 56, thegradient of a loss function with respect to the parameters (weightmatrices) of the neural network is computed. In particular, weightgradients 66, 68, and/or 70 (gradients of the errors with respect toweight matrices 100, 102, and/or 104) are computed throughbackpropagation of the errors 62 computed for the image at S108 (FIG.2). As the backpropagation proceeds from one layer to the previous one,the weight matrix 100, 102, 104 of one or more of the fully-connectedlayers is updated accordingly, as a function of the weight gradientscomputed for the previous layer (or error vector in the case of the lastlayer 104).

At S112, a weight gradient representation 12 of the image is generated,based on one or more of the computed matrices of weight gradients 66,68, and 70. For example one or more of these backward features 66, 68,and/or 70 is/are used individually or combined with each other and/orwith one or forward features to generate a single representation, e.g.,through a tensor product operation.

At S114, the representation 12 may be used for similarity computation(retrieval), by the similarity computation component 50. In thisembodiment, a measure of similarity is computed between the gradientrepresentation 12 of the images and gradient representations of otherimages in an image collection. In another embodiment, the representation12 is used for classification of the image, with the classifier 48. Ifthe representation 12 is to be used for classification, the classifier48 may have been trained using gradient representations 12 of trainingimages, such as images 58 or a different set of training images(computed as for the new image 14 at S112) and respective true classlabels 59.

At S116, information 72 is output which includes or is based on thegradient representation(s) 12 of one or more images (such as a label orlabel distribution output by the classifier 48, or a set of similarimages retrieved by the similarity component 50.

The method ends at S118.

The method illustrated in FIG. 2 may be implemented in a computerprogram product that may be executed on a computer. The computer programproduct may comprise a non-transitory computer-readable recording mediumon which a control program is recorded (stored), such as a disk, harddrive, or the like. Common forms of non-transitory computer-readablemedia include, for example, floppy disks, flexible disks, hard disks,magnetic tape, or any other magnetic storage medium, CD-ROM, DVD, or anyother optical medium, a RAM, a PROM, an EPROM, a FLASH-EPROM, or othermemory chip or cartridge, or any other non-transitory medium from whicha computer can read and use. The computer program product may beintegral with the computer 22, (for example, an internal hard drive ofRAM), or may be separate (for example, an external hard driveoperatively connected with the computer 22), or may be separate andaccessed via a digital data network such as a local area network (LAN)or the Internet (for example, as a redundant array of inexpensive ofindependent disks (RAID) or other network server storage that isindirectly accessed by the computer 22, via a digital network).

Alternatively, the method may be implemented in transitory media, suchas a transmittable carrier wave in which the control program is embodiedas a data signal using transmission media, such as acoustic or lightwaves, such as those generated during radio wave and infrared datacommunications, and the like.

The exemplary method may be implemented on one or more general purposecomputers, special purpose computer(s), a programmed microprocessor ormicrocontroller and peripheral integrated circuit elements, an ASIC orother integrated circuit, a digital signal processor, a hardwiredelectronic or logic circuit such as a discrete element circuit, aprogrammable logic device, or the like. In general, any device, capableof implementing a finite state machine that is in turn capable ofimplementing the flowchart shown in FIG. 2, can be used to implement themethod. As will be appreciated, while the steps of the method may all becomputer implemented, in some embodiments one or more of the steps maybe at least partially performed manually. As will also be appreciated,the steps of the method need not all proceed in the order illustratedand fewer, more, or different steps may be performed.

Further details of the system and method will now be provided.

Images 58, 14 may be received by the system 10 in any convenient fileformat, such as JPEG, GIF, JBIG, BMP, TIFF, or the like or other commonfile format used for images and which may optionally be converted toanother suitable format prior to processing. Input query images may bestored in memory during processing. Images 14 can be input from anysuitable image source 28, such as a workstation, database, memorystorage device, such as a disk, image capture device, such as a camera,or the like. In general, each input digital image includes image datafor an array of pixels forming the image.

The images 14, 58 may be individual images, such as photographs, videoframes, synthetic images, or the like. In one embodiment each image 14,may be a digital photograph expected to include a region in which anobject in one of the classes of interest is expected be visible as a setof pixels of the image. The image data of the image, which is input tothe CNN, may include colorant values for each of the pixels in theimage, such as grayscale values, for each of a set of color separations,such as L*a*b* or RGB, or be expressed in another other color space inwhich different colors can be represented. In general, “grayscale”refers to the optical density value of any single color channel, howeverexpressed (L*a*b*, RGB, YCbCr, etc.). The exemplary embodiment may alsobe used for black and white (monochrome) images or for images which havebeen converted to monochrome for convenient processing.

FIG. 3 schematically illustrates a neural network 56, e.g., inaccordance with the AlexNet architecture of Krizhevsky 2012, which isadapted for use in the present system and method. The neural network 56receives as input an image 14, which in the case of color images is a 3Dtensor-having for each pixel, three colorant values (R, G, and B in theillustrated embodiment). In the case of monochrome (black and whiteimages), the image is a 2D tensor. The supervised layers (operations) ofthe NN include a sequence 80 of at least one or at least twoconvolutional layers 82, 84, etc., and a sequence 86 of at least one orat least two fully-connected layers 88, 90, 92. In the case of the NN ofKrizhevsky, there are five convolutional layers and threefully-connected layers, although different numbers of these layers maybe employed.

Each convolutional layer 82, 84, etc. is parametrized by a tensor 96,98, etc., having one more dimension than the image, i.e., a 4D tensor inthe case of 3D images. Each tensor 96, 98, is a stack of 3D filters.C_(k), k=1, . . . , 5 represent the parameters of the 4D tensors 96, 98,etc. of the illustrated convolutional layers.

Each fully-connected layer 88, 90, 92, etc. is parametrized by a weightmatrix 100, 102, 104, etc. denoted W_(k), where k=6, 7, 8 which are theparameters (matrices) of the fully-connected layers. The stack offully-connected layers transforms output activation maps 106 of theconvolutional layers 82, 84, etc., into class-membership probabilities60.

During a forward pass of the NN 56 (in direction A indicated by the boldarrows), the filters 96, 98 are run in a sliding window fashion acrossthe output of the previous layer (or the image itself for the firstlayer 82) in order to produce a 3D tensor 108, 106, etc., which is astack of per-filter activation maps. These activation maps may then passthrough a non-linear transform (such as a Rectified Linear Unit, orReLU) and an optional pooling stage before being fed to the nextconvolutional layer. The ReLU may assign a value of 0 to all negativevalues in the maps. The activation maps 106 generated by the lastconvolutional layer 84 in the sequence are flattened, i.e., converted toa 1D tensor (a vector), before being input to the first of the fullyconnected layers. Each fully-connected layer performs a simple matrixvector multiplication followed by a non-linear transform, e.g., ReLU forintermediate layers and a softmax or sigmoid function 110 for the lastone. The softmax, sigmoid or similar non-linear function converts avector of arbitrary real values to a same length vector of real valuesin the range (0, 1).

At each fully-connected layer of the sequence 86, the input vector 106,112, 114 is converted to an output vector 112, 114, 116, which may havethe same or fewer dimensions (or in some cases, more dimensions). Theoutput 116 of the final fully-connected layer 92 is used to generate theset of predictions 60. Each prediction is a class probability for arespective one of the classes in the set of classes.

The following generalized notations are defined, which are illustratedin FIG. 4. Let x_(k) be the output of layer k, which is also the inputof layer k+1 (for AlexNet, x₅ is the flattened activation map of thefifth convolutional layer). Layer k is parametrized by the 4D tensorC_(k) if it is a convolutional layer, and by the matrix W_(k) for afully-connected layer. A fully-connected layer performs the operationx_(k)=σ(W_(k) ^(T)x_(k-1)), where σ is the non-linear function (ReLU forintermediate layers, softmax or similar for the last one). Lety_(k)=W_(k) ^(T)x_(k-1) be the output of layer k before thenon-linearity, and θ={C₁, . . . , C_(M)}∪{W_(M+1), . . . , W_(L)} theparameters of all L layers of the network 56.

For example, in the NN 56 illustrated in FIG. 3, the image 14 (andsimilarly each training image) is a 3D tensor (suppose that it includesabout 10,000 pixels in each of the 3 colors). In the first convolutionallayer, the tensor 96 includes 96 filters for each of 3 colors, eachfilter being 11×11 pixels, which is used as a sliding window to extractand filter 55×55=3025 overlapping windows from the image. The result ofthese filters on the input image 14 is a 3D tensor 108 of dimensions96×55×55 in the illustrative embodiment, i.e., 96 activation maps ofsize 55×55. As the sequence progresses, the number of “colors”(activation maps) in the output increases and the number of pixels foreach color decreases. The output of the last convolutional layer 84 is a256×6×6 tensor that is easily converted to a 9216 dimensional vector byconcatenation. When multiplied by a 9216×4096 matrix 100 of weights, theoutput is a 4096 dimensional vector 112.

Training the Neural Network (S102)

Training the NN 56 includes end-to-end learning of the vast number ofparameters θ 96, 98, 100, 102, 104 via the minimization of an error (orloss) function on a large training set of N images 58 and theirground-truth labels 59 (i.e., pairs (I^(i), g^(i))). During training,the set of predictions 60 output by the NN for each training image 58 iscompared with its true label 59 (which is typically a vector of zerosexcept for a 1 corresponding to the index of the true label). Thisresults in an error vector analogous to vector 62, which isback-propagated through each of the weight matrices and 4D tensors inturn to update them.

An exemplary loss function used for classification is the cross-entropy:

E(I ^(i) ,g ^(i);θ)=−Σ_(c=1) ^(P) g _(c) ^(i) log(x _(L,c) ^(i))  (1)

where P is the number of labels (categories), g^(i)ε{0,1}^(P) is thelabel vector of image I^(i), and x_(L,c) ^(i) is the predictedprobability of class c for image I^(i), i.e., the c-th element of theoutput x_(L) ^(i) 60 of the last layer L resulting from the forward passon image I^(i).

The optimal network parameters θ* are the ones minimizing this loss overthe training set 58. This can be expressed as:

$\begin{matrix}{\theta^{*} + {\arg \; {\min\limits_{\theta}{\sum_{i = 1}^{N}{E\left( {I^{i},{g_{i};\theta}} \right)}}}}} & (2)\end{matrix}$

This optimization problem can be solved using Stochastic GradientDescent (SGD), a stochastic approximation of batch gradient descentwhich entails performing approximate gradient steps equal, on average,to the true gradient ∇_(θ)E. See, for example, Krizhevsky 2012. Eachapproximate gradient step may be performed with a small batch of labeledexamples in order to leverage the caching and vectorization mechanismsof the computer system efficiently.

A particularity of deep networks is that the gradients with respect toall parameters θ can be computed efficiently in a stage-wise fashion viasequential application of the chain rule, a technique termed“back-propagation.” See, e.g., E. Rumelhart, et al., “Learningrepresentations by back-propagating errors,” Nature 323, pp. 533-536,1986. The training of the network (i.e., obtaining θ*) can be performedvia SGD with back-propagation on a large labeled dataset 58, such asImageNet (J. Deng, et al., “ImageNet: A Large-Scale Hierarchical ImageDatabase,” CVPR, 2009). In conventional classification methods, thelearned ConvNet parameters are used for labeling using a forward pass onthe pre-trained network. In the present application, back-propagation isused again, at test time, to transfer representations that are based onthe gradient of the loss with respect to the ConvNet parameters betweenthe fully-connected layers.

Gradient derivation (S110)

ConvNets and other feed-forward architectures are differentiable throughall their layers. In the case of ConvNets, the gradients of the loss(error) with respect to the weights of the fully-connected layers 88,90, 92 can be computed as follows:

$\begin{matrix}{\frac{\partial E}{\partial W_{k}} = {{x_{k - 1}\left\lbrack \frac{\partial E}{\partial y_{k}} \right\rbrack}^{T}.}} & (3)\end{matrix}$

using the notations shown in FIG. 4, i.e., the gradient matrix for agiven layer k is a function of the output of the prior layer and theerror vector of the next layer.

To compute the partial derivatives of the loss with respect to theoutput parameters used in Equation (3), the chain rule can be applied.In the case of fully-connected layers and ReLu non-linearities, thisleads to the following recursive definition:

$\begin{matrix}{\frac{\partial E}{\partial y_{k}} = {\left\lbrack {W_{k + 1}\frac{\partial E}{\partial y_{k + 1}}} \right\rbrack \; \circ _{\lbrack{y_{k} > 0}\rbrack}}} & (4)\end{matrix}$

where Π_([y) _(k) _(>0]) is an indicator vector, set to one at thepositions where y>0 and to zero otherwise, and ∘ is the Hadamard orelement-wise product. The Hadamard product can be rewritten in a compactform as a matrix multiplication, allowing the derivative to be computedas:

$\begin{matrix}{{\frac{\partial E}{\partial y_{k}} = {Y_{k}\left\lbrack {W_{k + 1}\frac{\partial E}{\partial y_{k + 1}}} \right\rbrack}}\mspace{14mu}} & (5)\end{matrix}$

where Y_(k) is a square diagonal matrix constructed with the elements ofΠ_([y) _(k) _(>0]).

Thus, for example, the gradient features forming the gradient matrix 68

$\left( \frac{\partial E}{\partial W_{7}} \right)$

are computed as a function of the forward features x₆ used as input tothe layer 90, the error vector 70

$\frac{\partial E}{\partial y_{8}}$

of the next layer 92, weight matrix W₈ 104 and the gradient of the errorwith respect to the output 92,

$\frac{\partial E}{\partial y_{8}},$

which can be decomposed recursively:

${\frac{\partial E}{\partial W_{7}} = {x_{6}\left\lbrack {Y_{7}\left\lbrack {W_{8}\frac{\partial E}{\partial y_{8}}} \right\rbrack} \right\rbrack}^{T}},$

The computing of the gradient of the loss function thus entailscomputing and extracting a set of forward features (e.g., x₆) from theneural network, computing and extracting a set of backward features fromthe neural network, the backward features comprising a gradient of theloss with respect to the output of a selected layer of the neuralnetwork computed in a backward pass of the neural network (e.g.,

$\left. \frac{\partial E}{\partial y_{8}} \right),$

and combining the forward and the backward features to construct a setof gradient features (e.g.,

$\left. \frac{\partial E}{\partial W_{7}} \right),$

the gradient features comprising a gradient of the loss with respect tothe parameters W₈ of a selected layer of the neural network. For thelast layer, the derivation of the gradient with respect to y, i.e., theerror 62 is straightforward and gives:

$\begin{matrix}{\frac{\partial E}{\partial y_{L}} = {g_{s} - {\sigma \left( y_{L} \right)}}} & (6)\end{matrix}$

where g_(s) is the standardized vector of labels 63 used to compute theloss. In the exemplary embodiment, g_(s) is an equal-valued vector,g_(s)=[1/P, . . . , 1/P], i.e., it is assumed that all classes haveequal probabilities. It can be seen that the gradient of the last layer

$\frac{\partial E}{\partial y_{L}}$

is simply a shifted version of the output probabilities 60, while thederivatives 70, 68, 66 w.r.t. y_(i) with i<L are linear transformationsof these shifted probabilities.

The gradient features of any one or more of the error matrices 66, 68,70 can be used to generate representation 12.

Computing Similarities Between Gradients (S114)

The exemplary gradient matrices 66, 68, 70 are very high-dimensional. Inthe case of the AlexNet architecture, for example, there are about 4million dimensions for

$\frac{\partial E}{\partial W_{8}},$

about 16 million for

$\frac{\partial E}{\partial W_{7}}$

and about 32 million for

$\frac{\partial E}{\partial W_{6}}$

In this case, computing the gradients explicitly and using thedot-product as a similarity measure between them may not becomputationally feasible. As an alternative, the structure of thegradient matrices (rank-1 matrices) can be leveraged by using the tracekernel as a similarity measure. The trace kernel between two matrices Aand B is defined as:

K _(tr)(A,B)=Tr(A ^(T) B)  (7)

It can be shown that, for rank-1 matrices, the trace can be decomposedas the product of two sub-kernels. Let A=au^(T), Aε

^(d×D), and B=bv^(T), Bε

^(d×D), with a, bε

^(d) and u, vε

^(D), then:

$\begin{matrix}\begin{matrix}{{K_{tr}\left( {A,B} \right)} = {{Tr}\left( {{au}^{T}\left( {bv}^{T} \right)}^{T} \right)}} \\{= {{Tr}\left( {{au}^{T}{vb}^{T}} \right)}} \\{= {{Tr}\left( {b^{T}{auv}} \right)}} \\{= {\left( {a^{T}b} \right) \cdot {\left( {u^{T}v} \right).}}}\end{matrix} & (8)\end{matrix}$

Therefore, for two images A and B, the similarity between gradients canbe computed in a low-dimensional space without explicitly computing thegradients with respect to the weights, as follows:

$\begin{matrix}{{K_{tr}\left( {\frac{\partial E^{(A)}}{\partial W_{k}},\frac{\partial E^{(B)}}{\partial W_{k}}} \right)} = {\left( {\left( x_{k - 1}^{(A)} \right)^{T}\left( x_{k - 1}^{(B)} \right)} \right) \cdot {\left( {\left\lbrack \frac{\partial E^{(A)}}{\partial y_{k}} \right\rbrack^{T}\left\lbrack \frac{\partial E^{(B)}}{\partial y_{k}} \right\rbrack} \right).}}} & (9)\end{matrix}$

In Equation (9), the left part of the equation indicates that theforward activations of the two inputs, x_(k-1) ^((A)) and x_(k-1)^((B)), should be similar. This is a standard measure of similaritywhich is used between images when described by the outputs of theintermediate layers of ConvNets. However, in the present method, thissimilarity is multiplicatively weighted by the similarity between therespective backpropagations

$\left\lbrack \frac{\partial E^{(A)}}{\partial y_{k}} \right\rbrack \mspace{14mu} {{{and}\mspace{14mu}\left\lbrack \frac{\partial E^{(B)}}{\partial y_{k}} \right\rbrack}.}$

This indicates that, to obtain a high similarity value with theexemplary kernels, not only the target forward activations need to besimilar, but also the backpropagation quantities need to be similar,i.e., the forward activations need to have been generated in a similarmanner. Two inputs that produce a high score on one activation layer buthave been generated using different activations will not yield a largesimilarity score.

It has also been found that l₂-normalizing the activation features(i.e., using a cosine similarity instead of the dot-product)consistently leads to superior results. This is consistent with using anormalized trace kernel, since ∥au^(T)∥_(F)=∥a∥₂∥u∥₂. In that case,similarity may be computed as:

$\begin{matrix}{{{K_{tr}\left( {\frac{\frac{\partial E^{(A)}}{\partial W_{k}}}{{\frac{\partial E^{(A)}}{\partial W_{k}}}_{F}},\frac{\frac{\partial E^{(B)}}{\partial W_{k}}}{{\frac{\partial E^{(B)}}{\partial W_{k}}}_{F}}} \right)} = {\frac{\left( x_{k - 1}^{(A)} \right)^{T}\left( x_{k - 1}^{(B)} \right)}{{x_{k - 1}^{(A)}}_{2}{x_{k - 1}^{(B)}}_{2}} \cdot \frac{\left\lbrack \frac{\partial E^{(A)}}{\partial y_{k}} \right\rbrack^{T}\left\lbrack \frac{\partial E^{(B)}}{\partial y_{k}} \right\rbrack}{{\frac{\partial E^{(A)}}{\partial y_{k}}}_{2}{\frac{\partial E^{(B)}}{\partial y_{k}}}_{2}}}},} & (10)\end{matrix}$

where the l₂ normalization of the individual activations is madeexplicit in the formulation.

Classification Using the Gradient Features (S114)

The exemplary representations 12, generated as described above, canadditionally or alternatively be used for classifying images using theclassifier 48. In particular, a classifier model is trained on the imagerepresentations 12 of labeled training samples (such as images 58) andtheir respective ground truth labels 59. Any suitable classifierlearning may be employed, such as support vector machines, for learninga linear classifier. A binary classifier model can be learned for eachclass or a multi-class classifier learned for all classes. Once theclassifier model has been learned, it can be used to assign a class toan input image, such as image 14, based on the representation 12.

Comparison of Gradient Features and Fisher Kernels

There are some similarities between the Fisher Kernel (FK) and thepresent methods using weight-gradient-based representations fromConvNets.

The Fisher Kernel is a generic principle introduced to combine thebenefits of generative and discriminative models to pattern recognition.Let X be a sample (image in this case), and let u_(θ) be a probabilitydensity function that models the generative process of X, where θdenotes the vector of parameters of u_(θ). In statistics, the scorefunction is given by the gradient of the log-likelihood of the data onthe model:

G _(θ) ^(X)=∇_(θ) log u _(θ)(X).  (11)

This gradient describes the contribution of the individual parameters tothe generative process. Jaakkola 1998 proposed to measure the similaritybetween two samples X and Y using the Fisher Kernel (FK) which isdefined as:

K _(FK)(X,Y)=G _(θ) ^(X) ′F _(θ) ⁻¹ G _(θ) ^(Y)  (12)

where F_(θ) is the Fisher Information Matrix (FIM):

F _(θ) =E _(X:u) _(θ) [G _(θ) ^(x) G _(θ) ^(x)′].  (13)

The FIM can be approximated by the identity matrix.

Some extensions make the dependence of the kernel on the classificationlabels explicit. For example, the likelihood kernel of Fine involves onegenerative model per class, and entails computing one FK for eachgenerative model (and consequently for each class). See, Fine, et al.,“A Hybrid GMM/SVM approach to speaker identification,” ICASSP, 2001. Thelikelihood ratio kernel of Smith and Gales is tailored to the two-classproblem and involves computing the gradient of the log-likelihood of theratio between the two class likelihoods. See Smith, et al., “Speechrecognition using SVMs. NIPS, 2001 (hereinafter “Smith 2001”). Given twoclasses denoted c₁ and c₂ with class-conditional probability densityfunctions p(•|c₁) and p(•|c₂) and with collective parameters θ, thisyields:

$\begin{matrix}{{\nabla_{\theta}\log}{\frac{p\left( {Xc_{1}} \right)}{p\left( {Xc_{2}} \right)}.}} & (14)\end{matrix}$

The exemplary gradient features derived from deep nets show somerelationship to the likelihood ratio kernel in Equation (14). As thestandard ConvNet classification architecture does not define agenerative model, the FK framework cannot be applied as is. It may benoted that Equation (14) can be rewritten as the gradient of thelog-likelihood of the ratio between posterior probabilities (assumingequal class priors):

$\begin{matrix}{{{\nabla_{\theta}\log}\frac{p\left( {Xc_{1}} \right)}{p\left( {Xc_{2}} \right)}} = {{\nabla_{\theta}\log}\; \frac{p\left( {c_{1}X} \right)}{p\left( {c_{2}X} \right)}}} & (15)\end{matrix}$

In the two-class problem of Smith 2001, p(c₂|X)=1−p(c₁|X) and theequation can be rewritten as:

$\begin{matrix}{{{\nabla_{\theta}\log}\frac{p\left( {Xc_{1}} \right)}{p\left( {Xc_{2}} \right)}} \propto {{\nabla_{\theta}\log}\; {p\left( {c_{1}X} \right)}}} & (16)\end{matrix}$

This shows that the likelihood ratio kernel in the two-class case isproportional to the gradient of the log-posterior probability of theclass in the numerator. To extend this representation beyond thetwo-class case, the gradient of the posterior probability of each classis computed, which results in one gradient vector per class. Thesegradient vectors can be aggregated by concatenating them or averagingthe gradient vectors corresponding to the different classes, i.e.:

$\begin{matrix}{{\nabla_{\theta}\Sigma_{i = 1}^{P}}\frac{1}{P}\log \; {{p\left( {c_{i}X} \right)}.}} & (17)\end{matrix}$

This is similar to Equation (1) where the gradient of the (minus)cross-entropy is computed between the output of the network and anuninformative uniform ground-truth label: g=[1/P, . . . , 1/P]. However,the present method generates a representation which can produce superiorresults, as illustrated in the Examples below.

Examples Datasets and Evaluation Protocols

The exemplary method was evaluated on two standard classificationbenchmarks, Pascal VOC 2007 and Pascal VOC 2012. These datasets contain9963 and 22,531 annotated images, respectively. Each image is annotatedwith one or more labels from a set of labels corresponding to 20 objectcategories. The datasets include well-defined partitions for training,validating, and testing, and the accuracy is measured in terms of thestandard mean Average Precision (mAP). The test annotations of VOC 2012are not public, but an evaluation server with a limited number ofsubmissions per week is available. Therefore, the validation set is usedfor the first part of the analysis on the VOC2012 dataset, andevaluation is performed on the test set only for the final experiments

Two different neural network architectures were evaluated: AlexNet(Krizhevsky 2012) and VGG16 (Simonyan 2014). Publicly available versionsof these neural networks that were previously trained on the ILSVRC2012subset of ImageNet were used (github.com/BVLC/caffe/wiki/Model-Zoo).

To extract descriptors from the Pascal VOC images, they were firstresized so that the shortest size had 227 pixels (224 on the VGG16case), and then a central square crop taken, without distorting theaspect ratio. This cropping technique was found to work well inpractice. Data augmentation (either multiple crops or mirroring) was notused for the training and testing images. Feature extraction isperformed on a customized version of the Caffe library(http://caffe.berkeleyvision.org), modified to expose thebackpropagation features. This allows extracting forward and backwardfeatures of the training and testing Pascal images on different layers.As discussed above, a uniform set of labels 63 was invented for thebackward pass to extract the gradient features. All forward and backwardfeatures are then l₂-normalized.

To extract features from the Pascal images, the softmax of the ConvNetwas replaced with a tempered version parametrized by temperature τ:

$\begin{matrix}{{{\sigma \left( {y,\tau} \right)}_{d} = \frac{\exp \left( {y_{d}/\tau} \right)}{\Sigma_{i}{\exp \left( {y_{i}/\tau} \right)}}},} & (18)\end{matrix}$

which reduces to the standard softmax when τ equals one, and providessofter probability distributions as r increases. This was found toproduce better results than when using the standard softmax, either as afeature by itself or to construct the gradient features. This is likelydue to the saturating nature of softmax and the change of domain(training on ImageNet, testing on Pascal VOC). Slightly increasing thetemperature to τ=2 or τ=4 noticeably impacted the results. For all theexperiments, τ was set to 2. As will be appreciated, the temperedsoftmax is used only to extract the features of the Pascal images usedin testing. The ConvNet is trained on ImageNet using the standardsoftmax.

To perform the classification, the SVM implementation provided inscikit-learn (scikit-learn.org) was used. A custom trace kernel is usedfor the similarity between gradient representations, as described above.The Gram matrix was explicitly constructed and the SVM objectivefunction (which uses the Gram matrix) is optimized in its dual form inall cases (not just in the trace kernel case) to avoid discrepanciesbetween different solvers. The cost parameter C of the solver was set tothe default value of 1, which worked well in practice.

Several features for each dataset and network architecture wereextracted and compared:

1. Individual forward activation features, for Pool5 (the output 106 ofthe last convolutional layer), each of the fully-connected (FC) layers88, 90, 92, and the final probabilities 60.

2. Concatenation of forward activation features, such as Pool5+FC6,FC6+FC7, or FC7+FC8.

3. Gradient features:

$\frac{\partial E}{\partial W_{6}},\frac{\partial E}{\partial W_{7}},{{and}\mspace{14mu} {\frac{\partial E}{\partial W_{8}}.}}$

The similarity between the l₂-normalized forward activation features ismeasured with the dot-product, while the similarity between gradientrepresentations is measured using the trace kernel. Table 1 summarizesthe results. The results on VOC 2012 are on the validation set. Tables 2and 3 show per-class accuracies for two of the features.

TABLE 1 Results on Pascal VOC 2007 and VOC 2012 with AlexNet and VGG16features VOC2007 VOC2012 Features AlexNet VGG16 AlexNet VGG16 x₅ (Pool5)71.0 86.7 66.1 81.4 x₆ (FC6) 77.1 89.3 72.6 84.4 x₇ (FC7) 79.4 89.4 74.984.6 y₈ (FC8) 79.1 88.3 74.3 84.1 x₈ (Predictions) 76.2 86.0 71.9 81.3$\begin{matrix}{x_{5};x_{6}} \\{\frac{\partial E}{\partial W_{6}} = {x_{5}\left\lbrack \frac{\partial E}{\partial y_{6}} \right\rbrack}^{T}}\end{matrix}\quad$ 76.4 80.2 89.2 89.3 71.6 75.1 84.0 84.6$\begin{matrix}{x_{6};x_{7}} \\{\frac{\partial E}{\partial W_{7}} = {x_{6}\left\lbrack \frac{\partial E}{\partial y_{7}} \right\rbrack}^{T}}\end{matrix}\quad$ 79.1 80.9 89.5 90.0 74.3 76.3 84.6 85.2$\begin{matrix}{x_{7};x_{8}} \\{\frac{\partial E}{\partial W_{8}} = {x_{7}\left\lbrack \frac{\partial E}{\partial y_{8}} \right\rbrack}^{T}}\end{matrix}\quad$ 79.7 79.7 89.2 88.2 75.3 75.0 84.6 83.4

TABLE 2 Pascal VOC2007 with AlexNet and VGG16 AlexNet VGG16 Features x₇(FC7) $\frac{\partial E}{\partial W_{7}}$ x₇ (FC7)$\frac{\partial E}{\partial W_{7}}$ mean 79.4 80.9 89.3 90.0 aeroplane95.4 96.6 99.2 99.6 bicycle 88.6 89.2 95.9 97.2 bird 92.6 93.8 99.1 98.8boat 87.3 89.5 96.9 97.0 bottle 42.1 44.9 63.8 63.3 bus 80.1 81.0 92.893.8 car 90.5 91.9 95.1 95.6 cat 89.6 89.9 98.1 98.4 chair 59.9 61.270.4 71.1 cow 68.2 70.4 87.8 89.4 dining table 74.1 78.5 84.3 85.3 dog85.3 86.2 97.0 97.7 horse 89.8 91.4 97.2 97.7 motorbike 85.6 87.4 93.595.6 person 95.3 95.7 97.3 97.5 potted plant 58.1 60.5 68.6 70.3 sheep78.9 78.8 92.2 92.7 sofa 57.9 62.5 73.3 76.2 train 94.7 95.2 98.7 98.8tv monitor 74.4 73.5 85.5 84.2

TABLE 3 Pascal VOC2012 with AlexNet and VGG16 AlexNet AlexNet VGG16VGG16 (evaluated on (evaluated on (evaluated on (evaluated on validationset) test set) validation set) test set) Fea- tures x₇ (FC7)$\frac{\partial E}{\partial W_{7}}$ x₇ (FC7)$\frac{\partial E}{\partial W_{7}}$ x₇ (FC7)$\frac{\partial E}{\partial W_{7}}$ x₇ (FC7)$\frac{\partial E}{\partial W_{7}}$ mean 74.9 76.3 75.0 76.5 84.6 85.285.0 85.3 aero- 92.9 94.3 93.8 95.0 98.2 98.6 97.8 98.0 plane bicycle75.4 77.4 75.0 76.6 88.3 89.4 85.2 86.0 bird 88.7 89.5 86.4 87.7 94.694.7 92.3 91.7 boat 81.7 82.2 82.2 82.9 90.5 91.5 91.1 91.3 bottle 48.050.8 48.2 52.5 66.0 67.2 64.5 65.7 bus 89.0 90.2 82.5 83.4 93.6 94.089.7 89.6 car 70.3 72.4 73.8 75.6 80.5 80.9 82.2 82.4 cat 88.0 89.3 87.688.6 96.4 96.8 95.4 95.5 chair 62.3 64.8 63.8 65.3 73.9 73.7 74.1 74.5cow 63.6 63.9 63.5 65.4 81.3 83.7 84.7 84.2 dining 57.8 60.3 69.3 69.870.2 71.9 81.1 80.7 table dog 83.5 84.0 85.7 86.5 93.0 93.4 94.1 94.3horse 78.0 79.6 80.3 82.1 91.3 91.6 93.5 93.7 motor- 82.9 84.0 84.1 85.191.3 91.5 91.9 92.2 bike person 92.9 93.2 92.3 93.0 95.1 95.4 95.0 95.4potted 49.1 50.6 47.4 48.2 56.3 56.0 57.9 57.7 plant sheep 74.8 76.772.2 74.5 87.7 88.3 86.0 87.2 sofa 50.5 52.6 51.8 57.0 64.2 65.2 67.869.2 train 90.2 91.8 88.1 88.4 95.8 95.5 95.2 95.2 tv 78.7 79.2 72.573.0 84.5 85.2 81.5 81.4 monitor

The results indicate the following:

Forward Activations:

In all cases, FC7 is the best performing individual layer on bothVOC2007 and VOC2012, independently of the network. The probability layerperforms badly in this case. Concatenating forward layers does not seemto bring any noticeable accuracy improvements in any setup.

Gradient Representations:

The gradient representations are compared with the concatenation offorward activations, since they are related and share part of thefeatures. On the deeper layers (6 and 7), the gradient representationsoutperform the individual features as well as the concatenation, bothfor AlexNet and VGG16, on both datasets. For AlexNet, the improvementsare quite significant: +3.8% and +3.5% absolute improvement for thegradients with respect to W₆ on VOC2007 and VOC2012, and +1.8% and +2%for W₇. The improvements for VGG16 are more modest but still noticeable:+0.1% and +0.6% for the gradients with respect to W₆ and +0.5% and +0.6%for the gradients with respect to W₇. Larger relative improvements onless discriminative networks such as AlexNet seem to suggest that themore complex gradient representation can, to some extent, compensate forthe lack of discriminative power of the network, but that diminishingreturns are obtained as the power of the network increases.

For the last of the layers 92 (layer 8) the gradient representations donot perform as well. This is not surprising since the derivative withrespect to W₈ depends heavily on the output of the probability layer(see Equation (6)), which is susceptible to saturation, hence theresults. However, for the derivatives with respect to W₆ and W₇, moreinformation is involved, leading to superior results.

Per-Class Results:

For the per-class results in Tables 2 and 3, the best forward features(the individual FC7) are compared with the best gradient representation

$\left( \frac{\partial E}{\partial W_{7}} \right).$

On both networks and datasets, the results are consistently better forthe gradient representations. For AlexNet, the gradient representationis equal to or better in performance than FC7 on 18 out of the 20classes on VOC2007, and on all classes for VOC2012. For VGG, thegradient representation performs equal to or better on 17 out of the 20classes both on VOC2007 and VOC2012.Comparison with Other Representations

The best results are compared with the following types of representationon PASCAL VOC2007 and VOC2012 in Table 4:

TABLE 4 Comparison with other ConvNet results on PASCAL VOC′07 andVOC′12 (mAP in %) Method VOC′07 VOC′12 Gradient - based on AlexNet 80.976.5 Gradient - based on VGG16 90.0 85.2 DeCAF (Donahue 2014, fromChatfield 2014) 73.4 — Razavian (Razavian 2014) 77.2 — Oquab (Oquab2014) 77.7 78.7 Zeiler (Zeiler 2014) — 79.0 Chatfield (Chatfield 2014)82.4 83.2 He * 80.1 — Wei ** 81.5 81.7 Simonyan (Simonyan 2014) 89.789.3 * He, et al., “Spatial Pyramid Pooling in Deep ConvolutionalNetworks for Visual Recognition,” ECCV, 2014. ** Wei, et al., “CNN:Single-label to Multi-label,” arXiv, 2014.

It can be seen that competitive performance is obtained on bothdatasets. The results with VGG16 are somewhat inferior to those reportedin Simonyan 2014 with a similar model. This may be due to the morecostly feature extraction strategy employed by Simonyan, which involvesaggregating image descriptors at multiple scales.

In further experiments, improved results were obtained with the presentmethod by adjusting the parameters of the ConvNet.

It will be appreciated that variants of the above-disclosed and otherfeatures and functions, or alternatives thereof, may be combined intomany other different systems or applications. Various presentlyunforeseen or unanticipated alternatives, modifications, variations orimprovements therein may be subsequently made by those skilled in theart which are also intended to be encompassed by the following claims.

What is claimed is:
 1. A method for extracting a representation from animage, comprising: inputting an image to a pre-trained neural network;computing a gradient of a loss function with respect to parameters ofthe neural network for the image; and extracting a gradientrepresentation of the image based on the computed gradients, wherein atleast one of the computing and the extracting is performed with aprocessor.
 2. The method of claim 1, wherein the computing of thegradient of the loss function comprises: computing and extracting a setof forward features from the neural network; computing and extracting aset of backward features from the neural network, the backward featurescomprising a gradient of the loss with respect to the output of aselected layer of the neural network computed in a backward pass of theneural network; and combining the forward and the backward features toconstruct a set of gradient features, the gradient features comprising agradient of the loss with respect to the parameters of a selected layerof the neural network.
 3. The method of claim 1, wherein the forward andbackward features are combined by matrix multiplication.
 4. The methodof claim 1, wherein in a forward pass of the neural network, aprediction vector of values is output for the image which includes aprediction value for each of a set of classes, the computing thegradient of a loss function comprising computing a vector of errorvalues based on a difference between the prediction vector and astandard vector comprising a standard value for each of the classes, andbackpropagating the error values through at least one of the layers ofthe neural network.
 5. The method of claim 1, wherein the parameters ofthe neural network comprise parameters of at least one of the fullyconnected layers of the neural network.
 6. The method of claim 1,further comprising outputting at least one of: the vectorialrepresentation of the image, and information based thereon.
 7. Themethod of claim 6, wherein the information comprises a computedsimilarity between the input image and another image based on respectivegradient representations of the image and the other image.
 8. The methodof claim 7, wherein the method includes computing the similarity betweenthe two images, comprising: computing a forward similarity betweenforward representations of the two images extracted in a forward pass ofthe neural network, computing a backward similarity between the backwardrepresentations of the two images, and computing the similarity bycombining the forward and backward similarities.
 9. The method of claim8, wherein the combining of the forward and backward similarities isperformed by multiplication.
 10. The method of claim 7, furthercomprising retrieving a set of similar images based on computedsimilarity between the input image and each a collection of images. 11.The method of claim 6, wherein the gradient representation is classifiedwith a classifier trained on gradient representations of labeled imagesand the information is based on the classifier classification.
 12. Themethod of claim 1 wherein the neural network is a convolutional network.13. The method of claim 1, wherein the gradient of the loss function iscomputed with respect to the weights of at least one of thefully-connected layers according to:$\frac{\partial E}{\partial W_{k}} = {{x_{k - 1}\left\lbrack \frac{\partial E}{\partial y_{k}} \right\rbrack}^{T}.}$14. The method of claim 1, wherein the neural network has beenpre-trained to predict labels for an image, the training having beenperformed with a set of labeled training images.
 15. A computer programproduct comprising a non-transitory recording medium storinginstructions, which when executed on a computer, causes the computer toperform the method of claim
 1. 16. A system comprising memory whichstores instructions for performing the method of claim 1 and a processorin communication with the memory for executing the instructions.
 17. Asystem for extracting a representation from an image, comprising: memorywhich stores a pre-trained neural network; a prediction component forpredicting labels for an input image using a forward pass of the neuralnetwork, and a gradient computation component for computing the gradientof a loss function with respect to parameters of the neural network forthe image based on the predicted labels on a backward pass of the neuralnetwork; a gradient representation generator for extracting a gradientrepresentation of the image based on the computed gradient; an outputcomponent which outputs the gradient representation or information basedthereon; and a processor in communication with the memory forimplementing the gradient component and prediction component.
 18. Thesystem of claim 17 further comprising at least one of: a classifier forclassifying the input image based on its gradient representation, and asimilarity component for computing a similarity between the input imageand another image based on respective gradient representations of theimage and the other image.
 19. A method for extracting a representationfrom an image, comprising: generating a vector of label predictions foran input image in a forward pass of a pre-trained neural network;computing an error vector based on differences between the vector oflabel predictions and a standardized prediction vector; in a backwardpass of the neural network with the error vector, computing the gradientof a loss function with respect to parameters of the neural network forthe image; extracting a gradient representation of the image based onthe computed gradients; and outputting the gradient representation orinformation based thereon, wherein at least one of the computing and theextracting is performed with a processor.
 20. The method of claim 19,wherein the standardized vector is an equal-valued vector.